Nature Genetics — Latest Matching Preprints

1

Incorporating phenotype heterogeneity in disease GWAS improves power while maintaining specificity

Hof, J. J. P.; Ning, C.; Quinn, L.; Speed, D.

2026-03-27 genetic and genomic medicine 10.64898/2026.03.26.26349370 medRxiv

Top 0.1%

41.0%

Show abstract

Common complex diseases are clinically heterogeneous, yet most genome-wide association studies (GWAS) assume cases are genetically homogeneous. This challenge is compounded in large-scale biobanks, which increasingly combine cases ascertained under different recruitment strategies, raising concerns that heterogeneous case definitions may dilute genetic signal. To address this, we developed StratGWAS, a scalable framework that leverages clinical features of heterogeneity to construct a transformed phenotype that better reflects genetic liability within diseases. StratGWAS stratifies cases using secondary phenotypic information such as age of onset, medication burden, or recruitment definition. StratGWAS then estimates genetic covariance between strata, and derives a transformed phenotype that upweights cases with higher inferred genetic liability. Through simulation studies (N = 100k) and application to the UK Biobank (N = 368k), we show that StratGWAS consistently outperformed standard GWAS methods. Applied to 21 UK Biobank traits, StratGWAS upweighted individuals with earlier disease onset and higher medication burden, yielding respectively 17% and 4% more independent genome-wide significant loci than standard case control GWAS. Applied to depression, StratGWAS upweighted individuals with multiple diagnoses, greater psychiatric comorbidity, or higher self reported depressive symptoms, identifying eight additional independent loci compared to case-control GWAS.

2

Cross-omic dissection reveals locus-specific heterogeneity and antagonistic pleiotropy between Alzheimer's disease and type 2 diabetes

Adewuyi, E. O.; Auta, A.; Okoh, O. S.; Selmer, K.; Gervin, K.; Nyholt, D. R.; Pereira, G.

2026-03-25 genetic and genomic medicine 10.64898/2026.03.23.26349030 medRxiv

Top 0.1%

40.9%

Show abstract

Observational studies associate type 2 diabetes (T2D) with increased dementia risk; however, the specificity of this relationship to Alzheimer's disease (AD) and its biological underpinnings remain unresolved. We apply an integrative cross-omic framework to dissect genetic links between AD and T2D. Genome-wide analyses reveal a modest positive genetic correlation and robust polygenic sign concordance of AD with T2D. High-resolution analyses demonstrate locus-specific heterogeneity, with coexisting positive and predominantly negative correlations, and strong inverse associations at APOE and HLA. Cross-trait GWAS meta-analyses indicate that most genome-wide significant signals reflect trait-specific effects, with only a limited set of variants supported in both AD and T2D. Colocalisation reveals distinct causal variants at most shared loci. Gene-based analyses highlight convergence at functional genes, including PLEKHA1, VKORC1, ACE, and APOE, without implying concordant variant-level effects. Bidirectional Mendelian randomisation (MR) shows no evidence of a causal relationship between AD and T2D in either direction. Summary-data MR prioritises genes whose expression or methylation affects both AD and T2D, mostly with opposing effects. Only PLEKHA1 (eQTL) and CAMTA2 (mQTL) show concordant positive associations. Five genes, GALNT10, HSD3B7, BCKDK, KAT8, and ACE, are supported across both regulatory layers, while numerous signals cluster within a regulatory hotspot at 16p11.2, supporting convergent transcriptional and epigenetic involvement, despite directional divergence. These results refine the AD-T2D relationship; rather than a simple shared-risk model, overlap reflects locus-specific heterogeneity and cross-omic convergence often showing opposing effects on AD versus T2D risk, consistent with antagonistic pleiotropy.

3

Phenome-wide association of multiallelic copy number variation in 422,170 UK Biobank individuals reveals novel genetic loci associated with disease

Eisenberg, M.; Packer, R.; Shrine, N.; Demidov, G.; Pack, H.; Hollox, E. J.; Fawcett, K.

2026-06-04 genetic and genomic medicine 10.64898/2026.06.03.26354825 medRxiv

Top 0.1%

39.9%

Show abstract

The contribution of multi-allelic CNVs (mCNVs) to disease risk has not been widely studied. This is largely because they have been difficult to characterise at a large-scale genome-wide, and are often not strongly associated with flanking SNVs, limiting imputation. Improved understanding of the role of mCNVs in disease risk could lead to novel insights into the pathobiology of disease. We robustly typed 69 mCNVs from UK Biobank whole exome sequences in discovery (n=150,682) and replication sets (n=269,317). Discovery and replication PheWAS used clinically-curated composite phenotypes by integrating self-report, primary and secondary health care data to interrogate these variants, for unrelated British individuals of African, European and Central/South Asian ancestries. 173 mCNV-phenotype associations were detected from 26 mCNVs, of which 114 associations replicated. One of eight potentially novel mCNV-phenotype signals was independent of neighbouring associated SNVs, the association of Sulfotransferase 1A1 and 1A2 genes (SULT1A1/SULT1A2) with estimated glomerular filtration rate (eGFR) in individuals of European ancestry (meta-analysed p=1.05x10-9, beta=0.016 [0.011; 0.021]). Other potentially novel associations include Golgi phosphoprotein 3 (GOLPH3) with the cardiovascular phenotype bundle branch block in individuals of South Asian ancestry (meta-analysed p=3.35x10-6, OR=2.13 [1.53, 2.96]) and alpha amylase 2B (AMY2B) with ventricular fibrillation and flutter in individuals of European ancestry (meta-analysed p=2.48x10-6, OR=1.50 [1.26; 1.78]). In summary, we show that accurate typing of biobank-scale sample sizes can identify associations between traits and mCNVs, acting through a gene dosage relationship. Our work provides several novel likely causative variants contributing to particular traits of clinical importance and immediately suggest a putative functional mechanism for the observed associations.

4

Resolving inflammatory bowel disease risk variants to genes and cell types

Fachal, L.; Zhang, R.; Gettler, K.; Haritunians, T.; Cleynen, I.; Stevens, C. R.; Zhang, Q.; Tastad, C.; Medici, C.; Do, R.; IIBDGC GWAS Group, ; Abreu, M. T.; Achkarj, J.-P.; Ahmad, T.; Bel Kok, K.; Bernstein, C.; Brooks, J.; Bujanda, L.; Butterworth, J.; Clark, K.; Cummings, F.; D'Amato, M.; Del Buono, J.; Duerr, R. H.; Ellinghaus, D.; Foley, S.; Franchimont, D.; Franke, A.; Hancock, L.; Hart, A.; Hooper, P.; Irving, P.; Jarvis, M.; Johnston, E.; Julia, A.; Kemp, C.; Kennedy, N.; Kupcinskas, J.; Latiano, A.; Lewis, J.; Li, A.; Limdi, J.; Louis, E.; McLaughlin, J.; Moayyedi, P.; Moran, G.; M

2026-05-18 genetic and genomic medicine 10.64898/2026.05.13.26352926 medRxiv

Top 0.1%

37.9%

Show abstract

Inflammatory bowel diseases (IBD), principally Crohn's disease (CD) and ulcerative colitis (UC), are common chronic disorders involving inflammation and often progressive tissue damage. Genome-wide association studies have mapped many risk signals, but the causal variants, effector genes and relevant cellular contexts remain difficult to resolve, limiting mechanistic interpretation and therapeutic translation. Here we performed a multi-ancestry GWAS meta-analysis of 125,992 individuals with IBD and more than 1.2 million controls, identifying 619 independent association signals (374 novel) at 420 IBD regions that account for 77-80% of SNP-based heritability. Fine-mapping resolved 81 high-confidence variants, 41 not previously reported. Although most signals were shared between CD and UC, 39% showed subtype specificity, with UC signals showing stronger enrichment in functional annotations from intestinal epithelial, secretory and enteroendocrine cells, and CD showing stronger genetic correlations with circulating inflammatory biomarkers, including C-reactive protein and glycoprotein acetylation. Latent causal modelling supported a causal effect of decreased high-density lipoprotein on CD risk. By integrating bulk and single-cell eQTL and pQTL resources using colocalisation and Mendelian randomisation, together with coding-variant evidence from exome sequencing, we prioritised 664 candidate effector genes across 341 signals, including 390 newly implicated IBD genes, revealing new biological mechanisms and candidate therapeutic targets supported by human genetics.

5

Genome-wide association and Mendelian randomization analyses link Helicobacter pylori infection to Human Leukocyte Antigen polymorphisms and autoimmune diseases.

Kyosaka, T.; Narita, A.; Kulski, J. K.; Minn, A. K. K.; Miyake, A.; Kotsar, Y.; Hiraide, K.; Ojima, T.; Nakatochi, M.; Namba, S.; Yamaji, T.; Sutoh, Y.; Sasaki, Y.; Broer, L.; Frost, F.; Koyanagi, Y. N.; Kasugai, Y.; Ito, H.; Sawada, N.; Nakano, S.; Suzuki, S.; Hishida, A.; Koyama, T.; Kubo, Y.; Funayama, T.; Makino, S.; Shirota, M.; Takayama, J.; Gocho, C.; Sugimoto, S.; Otsuka-Yamasaki, Y.; Tanno, K.; Abe, Y.; Nakajima, O.; Spaander, M. C. W.; Weiss, S.; Lerch, M. M.; Levy, D.; Hwang, S.-J.; Wood, A. C.; Rich, S. S.; Rotter, J. I.; Taylor, K. D.; Tracy, R. P.; Stocker, H.; Brenner, H.; Leja,

2026-03-30 gastroenterology 10.64898/2026.03.27.26349357 medRxiv

Top 0.1%

34.6%

Show abstract

Helicobacter pylori (H. pylori) infects the gastric epithelium of approximately half of the global population, and is a well-known risk factor for developing gastric cancer. Despite the clinical significance of H. pylori infection, many genetic factors that contribute to susceptibility remain unidentified. While it is well-established that H. pylori infection can result in gastritis and peptic ulcers, which may progress to gastric cancer, its causal link to other diseases remains unclear. We performed the genome-wide association study (GWAS) for anti-H. pylori IgG antibody titers, which were validated as a surrogate marker for H. pylori infection by the correlation with clinical traits, followed by gene-based and pathway analyses, involving up to 140,863 individuals. This included 56,967 in the discovery phase, and 68,211 in the replication phase from Japanese cohorts, and an additional 15,685 from European populations in a cross-ancestry meta-analysis. We reveal significant associations between H. pylori infection and polymorphisms in Human Leukocyte Antigen (HLA) genes the Human Leukocyte Antigen (HLA) class II region within the Major Histocompatibility Complex (MHC), as well as genes related to innate immunity, including CCDC80, NFKBIZ, TIFA, PSCA, and TRAF3. Mendelian randomization (MR) analysis revealed that genetic liability to H. pylori infection has both positive and negative causal relationships with a variety of diseases, including autoimmune-related diseases such as Type 1 diabetes, Hashimoto's disease, atopic dermatitis, as well as traits like body height and weight. These genetic findings strongly support the notion that genetic liability to H. pylori infection influences not only gastrointestinal diseases, but also a broader spectrum of health issues, thereby providing valuable insights for public health strategies and personalized medicine approaches.

6

Effect of ancestry and shared genetic architecture of serious mental illness on symptoms and cognition in an admixed Latin American population

Lopera Maya, E. A.; Service, S. K.; Diaz-Zuluaga, A. M.; Castano Ramirez, M.; Mejia, J. C.; Valencia, J.; Teshiba, T.; Freimer, N. B.; Ramirez-Diaz, A. M.; de la Hoz Gomez, J. F.; Valdez, J.; Munoz Umanes, M.; Moore, T. M.; Chapman, S.; Neale, B.; Bearden, C. E.; Escobar, J. I.; Gur, R. C.; Reus, V. I.; Sabati, C. E.; Olde Loohuis, L.; Lopez Jaramillo, C.

2026-05-06 genetic and genomic medicine 10.64898/2026.05.05.26351986 medRxiv

Top 0.1%

33.8%

Show abstract

Most genome-wide association studies (GWAS) of serious mental illness (SMI) have been conducted for categorical diagnoses in samples of primarily European ancestry. The portability of findings to non-Europeans, and to SMI-related symptoms/dimensional traits remains uncertain. In a sample of 8,666 SMI cases and controls from the Paisa region of Colombia we show that a primarily European schizophrenia GWAS polygenic risk score (PRS) predicted all SMI diagnoses in this sample, as well as symptoms (assessed in cases only) and traits assessed agnostic to SMI diagnosis: a one SD unit (SDU) increase in this PRS was associated to decreased risk in cases of suicidal thoughts (OR=0.89, 95% confidence interval 0.84-0.94), depressed mood (OR=0.90, 95% confidence interval 0.85-0.95), and increased risk of delusions (OR=1.12, 95% confidence interval 1.06-1.18) and to decreased cognition (in cases and controls) across five distinct domains (average decrease in cognition of 0.065 SDU, p<7e-05). We show that a published European GWAS of cognition predicted levels of executive function (average decrease in cognition of 0.06 SDU per unit increase in PRS, p<2e-04), but not diagnosis or symptoms. Specific loci identified in the SMI GWAS also showed association to multiple diagnoses, symptoms, and cognitive traits in Paisa. The most noteworthy result was for a locus on chromosome 7p22.3, associated in multiple SMI GWAS, that showed association in Paisa to increased risk of bipolar disorder, and to reduced complex cognition and social cognition. Our findings demonstrate wide portability from European GWAS to an admixed American sample, with associations to multiple transdiagnostic phenotypes.

7

Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv

Top 0.1%

33.2%

Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

8

Genetics of cannabis ever-use and frequency across ancestries implicate novel loci and brain-specific biology

Pasman, J. A.; Gerring, Z. F.; Thorp, J. G.; Abdellaoui, A.; Youssef, P.; Ori, A.; Smadi, M.; Thijssen, A. B.; Woodward, D.; Wormington, B.; Adkins, D. E.; Aliev, F.; Aliev, F.; Chatzinakos, C.; Elson, S. L.; Fontanillas, P.; Gizer, I. R.; Gu, H.; Hines, L. A.; Johnson, E. C.; Koiv, K.; Lind, P. A.; Lind, P. A.; Lind, P. A.; Mosing, M. A.; Nolte, I. M.; Ong, J.-S.; Otto, J. M.; Palviainen, T.; Peterson, R. E.; Sallis, H. M.; Shabalin, A. A.; Shabalin, A. A.; Shin, J.; Thomas, N. S.; Thomas, N. S.; van der Laan, C. M.; van der Most, P. J.; van Dorsselaer, S.; van Eijk, K. R.; Wootton, R. E.; Wo

2026-04-27 public and global health 10.64898/2026.04.25.26351611 medRxiv

Top 0.1%

33.0%

Show abstract

Cannabis use is widespread, with genetic differences partly explaining variation in individual patterns of use. We performed the largest-to-date genome-wide association study (GWAS) meta-analysis of cannabis ever-use (N=736,322, 76% European ancestry) and various measures of frequency of use (N=269,160 cannabis users, 84% European ancestry). We identified 54 independent genome-wide significant loci for ever-use and 6 for frequency and show that the genetic architecture of ever-use, frequency, and cannabis use disorder (CUD) are overlapping but distinguishable. We identified 63 loci that were associated with common liability ( All-cannabis) to different cannabis use traits in European-ancestry individuals. Across analyses, we identified 75 unique loci that had not previously been implicated in cannabis use. Gene prioritization analyses identified 349 genes for ever-use, 5 genes for frequency of use, and 429 for All-cannabis, including previously identified and novel genes. We found enrichment of genetic signals for cannabis use in biologically meaningful categories and relevant human brain cell types, including excitatory neuronal populations. There were substantial genetic correlations between cannabis use and a range of psychiatric disorders and substance use traits, while cannabis polygenic scores were associated with increased risk of psychiatric disorders. Mendelian Randomization showed evidence for (bidirectional) causal associations between cannabis use and ADHD, bipolar disorder, schizophrenia and PTSD.

9

Genetic determinants of cytokine production in activated human monocytes

Gilchrist, J. J.; Mentzer, A. J.; Jostins, L.; Makino, S.; Naranbhai, V.; Danielli, S.; Nassiri, I.; Knight, J. C.; Fairfax, B. P.

2026-05-13 genetic and genomic medicine 10.64898/2026.05.08.26352736 medRxiv

Top 0.1%

33.0%

Show abstract

Monocyte function plays a central role in human health and mapping the genetic determinants of monocyte gene expression has provided insights into numerous disease processes. The relationship between genetic variation and functional cytokine secretion in response to immune stimuli remains poorly characterised however. To address this, we have quantified the production of 28 cytokines by monocytes from 366 healthy, European-ancestry donors following activation with lipopolysaccharide (LPS) and interferon gamma (IFN{gamma}). By integrating these data with genomic and transcriptomic data from the same cells we robustly define the regulatory determinants of monocyte cytokine secretion. We identify four genome-wide significant loci affecting monocyte cytokine release, observing both cis and trans regulatory effects on cytokine release. These loci include multi-cytokine trans regulatory activity of the CCR5-{Delta}32 deletion on secretion of the CCR5-binding cytokines, MIP-1{beta} and RANTES, and a cis regulator of PDGF-BB secretion, which colocalises with GWAS risk loci for ulcerative colitis and primary biliary cirrhosis. We further map the genetics of co-expression to establish relationships between RNA transcription and cytokine protein secretion. In doing so we identify marked enrichment of genes related to lipid metabolism in gene regulatory networks linked to cytokine secretion and identify that the COVID-19 risk locus at OAS1 uncouples OAS1 RNA expression from the secretion of 10 cytokines in response to LPS stimulation.

10

A genome-wide deletion map in 125,730 individuals for novel rare disease gene and variant discovery

McGuigan, A.; Pagnamenta, A. T.; Covill, L. E.; Sampson, J.; Camps, C.; Chen, Y.; Moitra, T.; Chundru, V. K.; O'Heir, E.; Allan, K.; Arno, G.; Broomfield, A.; Delatycki, M.; Lin, S.; Michaelides, M.; Rius, R.; Roscioli, T.; Simons, C.; Webster, A.; White, S. M.; Wilson, L.; Sanders, S. J.; O'Donnell-Luria, A.; Ellingford, J. M.; Taylor, J. C.; Whiffin, N.

2026-05-15 genetic and genomic medicine 10.64898/2026.05.13.26352722 medRxiv

Top 0.1%

32.1%

Show abstract

Structural variants (SVs) can disrupt gene function and contribute to pathogenesis of rare disorders. Here, we created a genome-wide knockout dataset across 125,730 individuals with genome sequencing data in the UK's National Genomic Research Library by leveraging the distinct read-depth signal associated with homozygous deletions. We curated 535,699 rare high-confidence homozygous deletion SVs, of which 48,735 were rare. These deletions collectively covered 213Mb or 6.92% of the human genome (4.58% of autosomal sequence), revealing substantial tolerance to complete sequence loss. From a subset of 58,022 individuals with rare disease, we identified 295 individuals with likely diagnostic homozygous deletions impacting protein-coding regions of known disease genes. A further 32 individuals had candidate non-coding SVs in or near to known disease genes, 19/32 (59.37%) of which disrupted 5-UTR/promoter regions, revealing promoter deletion as an underappreciated cause of rare disorders. Finally, we identify 43 genes with no known rare-disease association but with exonic homozygous deletions in two or more individuals with consistent phenotypes. We describe in detail PDC (phosducin) in Leber Congenital Amaurosis, GCG (glucagon) for a syndromic neurodevelopmental disorder with gastrointestinal involvement, and ENTPD3 for intellectual disability with autism, as candidate novel disease-associated genes. Overall, we create a genome-wide map of homozygous deletions and demonstrate the power of this dataset for rare disease diagnosis and novel disease-gene discovery.

11

Optimizing phenotype scale improves genetic analyses in large-scale biobanks

Huang, Z.; Costantino, M.; Dahl, A.

2026-05-07 genetics 10.64898/2026.05.04.722531 medRxiv

Top 0.1%

28.7%

Show abstract

Large-scale biobanks have enabled increasingly complicated genetic analyses across thousands of phenotypes. However, studies rarely consider the appropriate phenotype measurement scale, a problem that can drastically affect inferences on genetic architecture. Here, we introduce SIQReg, a practical solution to this classical problem, which learns a data-driven phenotype scale by minimizing heterogeneity across phenotype quantiles. Applied to complex traits in UK Biobank, SIQReg rejects the default scale for 24/25 traits. Generally, SIQReg scales lie between default and logarithmic, indicating that default-scale traits are neither purely additive nor purely multiplicative. We show that SIQReg improves both non-additive and additive genetic analyses. SIQReg eliminates most non-additive genetic signals (such as 97% of vQTL and 76% of quantile-dependent TWAS genes), indicating they may be statistical artifacts, while preserving biologically plausible non-additive signals. Simultaneously, SIQReg improves power to detect additive signals, increasing GWAS loci, TWAS genes, and PGS prediction accuracy by 11%, 13%, and 10%, respectively, and identifies 50% more high-risk individuals. These gains replicate across ancestry groups. Our results establish SIQReg as a principled approach to phenotype scale transformation that improves genetic analyses of complex traits.

12

SNPic: SNP Topic Modeling for Interpretable Clustering of Complex phenotypes

Leyi, Z.; Seiler, C.; Speed, D.; Micheroli, R.; Ospelt, C.

2026-04-24 genetics 10.64898/2026.04.22.720106 medRxiv

Top 0.1%

27.7%

Show abstract

Genome-wide association studies (GWAS) have cataloged thousands of disease-associated variants, yet a central challenge remains: decoding the shared, pleiotropic architecture that links complex phenotypes. Existing approaches, including dimensionality reduction methods and regression genetic models, either lack interpretability or rely on external linkage disequilibrium (LD) reference panels, limiting their ability to recover coherent biological mechanisms. Here we introduce the SNP topic model (SNPic), a generative probabilistic framework that reframes GWAS summary statistics as a structured corpus and models genetic architecture using principles from Natural Language Processing (NLP). By treating phenotypes as documents and genes or the whole corpus of traits as words, SNPic applies topic models, e.g. Latent Dirichlet Allocation (LDA), to infer latent "genetic topics", representing interpretable, overlapping biological modules that jointly explain complex traits. This formulation enables simultaneous reconstruction of trait relationships and identification of their underlying molecular drivers. SNPic integrates two complementary schemes: Sumstat-as-word for capturing global phenotypic structure and Gene-as-word for resolving mechanistic detail, within a unified modeling framework. To ensure robustness, we introduce a stability-optimized inference pipeline based on bootstrap resampling, allowing data-driven selection of topic number and filtering of stochastic signals. Across extensive simulations, SNPic consistently outperforms conventional dimensionality reduction methods in recovering latent structure under both linear and non-linear, highly overlapping genetic architectures. Applied to integrated FinnGen and UK Biobank datasets, SNPic identifies reproducible genetic topics corresponding to distinct biological programs, including HLA-mediated immune processes and transporter-driven metabolic regulation, with strong tissue-specific support. The framework further generalizes across species, organizing complex traits in maize, Arabidopsis thaliana, and cattle into biologically coherent modules. Together, these results establish SNPic as a scalable and interpretable framework that shifts GWAS analysis from association cataloging toward the construction of an interpretable knowledge graph representing the latent semantic architecture of the genome. By unifying statistical genetics with NLP, SNPic reframes GWAS analysis as a probabilistic language modeling task, enabling the systematic decoding of complex trait architectures and delivering a systemic graph of cross-phenotype relationships.

13

The New York Genome Center ALS Consortium resource combines postmortem tissue transcriptomics with whole genome sequencing to empower biological discovery

Humphrey, J.; Oku, A.; Byrska-Bishop, M.; Basile, A.; Evani, U. S.; Corvelo, A.; Tokolyi, A.; BP, K.; Real, A.; Kim, Y.; Bond, M.; Clarke, W. E.; Fu, R.; Geiger, H.; Chang, S.; Naito, T.; Jang, B.; Musunuri, R.; Dredge, W.; Al-Abri, R.; Hoover, B. N.; Manaa, D.; McClintock, J.; Singh, F. P.; Pedersen, M. H.; Runnels, A.; Propp, N.; Fennessey, S.; Won, H.-H.; Zody, M. C.; Narzisi, G.; Robine, N.; Lappalainen, T.; Fagegaltier, D.; Gursoy, G.; Knowles, D. A.; Raj, T.; NYGC ALS Consortium, ; Harms, M. B.; Phatnani, H.

2026-05-04 neurology 10.64898/2026.04.29.26350889 medRxiv

Top 0.1%

27.6%

Show abstract

Amyotrophic lateral sclerosis (ALS) is a devastating neurodegenerative disease with substantial genetic and clinical heterogeneity that impedes therapeutic development. Large-scale multi-tissue genomic resources have transformed the study of neuropsychiatric and neurodegenerative diseases, but no equivalent resource exists for ALS. Here we present the full NYGC ALS Consortium dataset, combining whole-genome sequencing from 4,746 donors and bulk RNA-seq from 2,574 samples across 8 brain and spinal cord regions from 695 donors across the ALS disease spectrum. Our catalogue of small variants, structural variants, and short tandem repeats identified likely pathogenic mutations in 15.6% of ALS cases. Gene expression and mRNA splicing analysis across 5 major tissues reveals shared and region-specific features, highlighting microglial and T-cell dysregulation in the spinal cord. Mapping the genetic regulation of expression and splicing across tissues identified associations with 6 ALS risk loci, whereas allele-specific rare variant analysis detected expression effects for C9orf72 and OPTN. All data are immediately publicly available.

14

Modeling cis-regulatory variation in human brain enhancers across a large Parkinson's Disease cohort

Sigalova, O. M.; Pancikova, A.; De Man, J.; Theunis, K.; Hulselmans, G. J.; Konstantakos, V.; Stuyven, B.; De Brabandere, A.; Geurts, J.; Mikorska, A.; Mukherjee, S.; Abouelasrar Salama, S.; Vandereyken, K.; Davie, K.; Mahieu, L.; Adler, C. H.; Beach, T. G.; Serrano, G. E.; Voet, T.; Demeulemeester, J.; Aerts, S.

2026-03-19 genomics 10.64898/2026.03.15.711881 medRxiv

Top 0.1%

27.5%

Show abstract

Genome-wide association studies (GWAS) have linked more than hundred non-coding genomic loci to Parkinsons disease (PD) risk. Deciphering their functional impact on gene regulation requires cell type-aware modeling approaches to assess the effects of sequence variation on enhancer function and target gene expression. To address this challenge, we generated a comprehensive matched dataset from 190 human donors (115 controls and 75 PD), comprising long-read whole-genome sequencing alongside single nucleus multiome atlases (snATAC-seq and snRNA-seq for 3.1 and 1.1 million nuclei respectively) of the anterior cingulate cortex and substantia nigra. By integrating chromatin accessibility quantitative trait loci (caQTL), DNA methylation QTL (meQTL), and allele-specific chromatin accessibility (ASCA), we identified 53,841 high-confidence cis-acting genetic variants that modulate cell type-specific enhancer accessibility in one or both brain regions. We then demonstrate that sequence-to-function models can accurately predict the impact of these variants directly from the genomic sequence. Novel explainability approaches allowed stratifying these variants according to their regulatory function, with the majority disrupting specific transcription factor binding sites in a cell type specific manner. Integrating these "enhancer variants" (EV) with eQTL mapping and gene locus modeling linked a subset of EVs to their target genes. Finally, we applied these models to prioritize regulatory variants at known PD GWAS loci, bypassing statistical limitations in rare disease-relevant populations like dopaminergic neurons. All together, we establish a unique resource and new sequence modeling strategies to interpret functional non-coding variation in the human brain.

15

Integrating 730,947 exome sequences with clinical literature improves gene discovery

Guez, J.; Goodrich, J. K.; Moldovan, M. A.; Chao, K. R.; Kar, P.; Panchal, R.; Wilson, M. W.; Laricchia, K. M.; Rohlicek, G.; Biba, D.; Marten, D.; He, Q.; Darnowsky, P. W.; Grant, R.; Weisburd, B.; Baxter, S. M.; Nadeau, J.; Lu, W.; Jahl, S.; Parsa, S.; Lamane, A.; DiTroia, S.; Fu, J.; Zhao, X.; Alarmani, E.; Tolonen, C.; Novod, S.; Bryant, S.; Stevens, C.; Chapman, S. B.; Cusick, C.; Vittal, C.; Gauthier, L. D.; Goldstein, J. I.; Goldstein, D.; King, D.; gnomAD Project Consortium, ; Tranchero, M.; Lotter, W.; MacArthur, D. G.; Brand, H.; Seplyarskiy, V.; Koch, E.; Talkowski, M. E.; Solomons

2026-03-25 genetic and genomic medicine 10.64898/2026.03.23.26349081 medRxiv

Top 0.1%

27.5%

Show abstract

Accurate estimates of allele frequencies aid in genetic discovery, including rare disease diagnosis, common disease investigations, and population genetics. Here, we present the Genome Aggregation Database version 4 (gnomAD v4), comprising 807,162 sequenced individuals including 730,947 exomes, a fivefold increase over previous releases, and 76,215 genomes. We demonstrate that statistical power to detect strong selective constraint continues to increase with sample size. We develop a new loss-of-function annotation pipeline, which learns genomic features predictive of nonsense-mediated decay and splicing effects from selection signals, achieving 90% precision for distinguishing likely true versus false positive loss-of-function variants. This improved pipeline, along with incorporation of highly deleterious missense variants into measures of loss-of-function intolerance, improves disease gene detection particularly for short genes and those with gain-of-function mechanisms. To improve disease gene prediction, we systematically extract gene-disease associations from biomedical literature, map these to gene-level biological features, and integrate both with refined constraint metrics within a Bayesian framework, yielding state-of-the-art prediction of gene-disease relevance. Building on this integration, we define a Discovery Potential (DisPo) score that highlights genes under strong constraint but limited clinical characterization. High-DisPo genes are enriched in embryonic lethal and fertility phenotypes, supporting DisPo as a tool to prioritize previously under-characterized disease genes. Together, these advances establish a unified framework for accelerating gene discovery and improving rare disease diagnosis.

16

Wavelet Decomposition-Based Genomic Analysis of the Human Electrocardiogram

Zainana, S.; Lauer, L. P.; Kiiskinen, T.; Tibshirani, R. j.; Hastie, T.; Ashley, E.; O'Sullivan, J. W.; Rivas, M. A.

2026-05-24 cardiovascular medicine 10.64898/2026.05.20.26353725 medRxiv

Top 0.1%

26.6%

Show abstract

The electrocardiogram (ECG) encodes the electrical activity of the heart across multiple timescales, yet standard clinical analysis collapses this rich signal into a handful of scalar measurements that discard most of the waveform's structure. Whether the frequency signals lost in this reduction carry heritable biological information relevant to cardiovascular disease risk remains unclear. Here we decompose resting 12-lead ECGs from 47,052 White British UK Biobank participants into 84 frequency-specific energy features using Daubechies-6 wavelet analysis across 12 leads and 7 decomposition levels, and perform independent genome-wide association analyses on each feature. We identify 67 independent loci and refine these to 101 high-confidence causal variants (posterior inclusion probability > 0.80) through Bayesian fine-mapping; associated loci converge on genes governing cardiac conduction and myocardial integrity, including SCN5A, TTN, KCNQ1, and DSP, alongside less-characterized cardiomyopathy candidates. SNP-based heritability estimates range from 0.03 to 0.26, with the strongest signals in mid-frequency bands (D6-D4, ~4-32 Hz) of Lead I and aVR, and strong inter-lead genetic correlations indicate a coordinated genetic architecture underlying the waveform. Integrating these features with FinnGen R12 cardiovascular phenotypes reveals genetic correlations reaching 0.56 with heart failure, driven predominantly by energy in the highest-frequency band (D1, 125-250 Hz), a spectral range routinely filtered from clinical ECGs and previously regarded as acquisition noise. These results reframe the electrocardiogram as a multi-frequency genetic phenotype, expand the set of cardiac loci discoverable from ECG data, and implicate high-frequency cardiac electrical activity as an underexplored dimension of cardiovascular disease risk.

17

A pan-cancer regulatory atlas of 6,983 GWAS variants prioritizes recurrent regulatory annotations and candidate programs at cancer risk loci

Dutta, S.

2026-05-20 genetic and genomic medicine 10.64898/2026.05.16.26353369 medRxiv

Top 0.1%

26.1%

Show abstract

Genome-wide association studies have identified thousands of cancer risk variants in non-coding regions, yet their regulatory mechanisms remain largely uncharacterized. Here we present a regulatory annotation atlas of 6,983 genome-wide significant variants across 23 cancer types, scored using multimodal AlphaGenome predictions and integrated with ENCODE-4, Roadmap Epigenomics, and JASPAR 2024 annotations. Most variants (70.5%) fall outside annotated cis-regulatory elements; 27.7% overlap enhancers and 1.4% overlap promoters. Comparison with 6,626 position-matched eQTL control variants suggests that enhancer-classified variants carry 1.86-fold higher predicted effects (P = 1e-94) and promoter variants 7.84-fold (P = 2.5e-19). A composite prioritization score (RegVar-basic, excluding GWAS-derived pleiotropy and TF disruption, AUC = 0.650; RegVar-full, AUC = 0.675) outperforms CADD (0.499) and LINSIGHT (0.558) in this cancer-gene discrimination benchmark. Within-locus ranking across 2,626 GTEx DAP-G eQTL credible sets shows that RegVar identifies the highest-posterior-probability variant in 47.3% of loci (P = 7.0e-13), while CADD performs at chance. Predicted target genes show 67.7% concordance with GTEx eQTL assignments. Permutation-controlled motif analysis highlights NFKB1, STAT1, IRF1, and ARNT as exploratory permutation-enriched candidate transcription factors at cancer risk loci. This atlas provides a resource for interpreting non-coding cancer susceptibility variants. Because AlphaGenome uses expression-related training data, GTEx-based validations should be interpreted as partially orthogonal rather than fully independent.

18

Whole Genome Sequencing Reveals a RET Enhancer Risk Haplotype Associated with Hirschsprung Disease in Mowat Wilson Syndrome

Collins, S.; Bah, I.; Pysar, R.; Mowat, D.; Turner, T. N.; Chatterjee, S.

2026-03-23 gastroenterology 10.64898/2026.03.19.26348831 medRxiv

Top 0.1%

26.0%

Show abstract

Mowat Wilson syndrome (MWS) is a rare neurodevelopmental disorder caused by mostly heterozygous loss-of-function variants in ZEB2. Affected individuals show considerable wide variability in clinical presentation. In particular, Hirschsprung disease (HSCR) occurs in only a subset of patients, suggesting that additional genetic factors may modify disease penetrance. To investigate this possibility, we performed whole-genome sequencing of two parent-child trios in which the probands carried pathogenic de novo ZEB2 variants but differed in enteric phenotype: one individual with MWS and long-segment HSCR and another with MWS without HSCR. In both probands, the ZEB2 variants represent the primary causative genomic diagnosis, and no additional rare coding variants or excess copy-number burden provided a clear alternative explanation for HSCR. Phasing of a previously defined 10 single nucleotide polymorphisms(SNPs) RET enhancer haplotype revealed inheritance of a high-risk haplotype in the proband with HSCR, whereas the proband without HSCR carried only low-risk haplotypes on both chromosomes. To place these findings in a developmental context, we analysed single-cell transcriptomic data from the developing human fetal gut and neocortex. ZEB2 and RET show overlapping expression in enteric neural crest progenitors and neuroblasts but minimal overlap in the developing neocortex, indicating that reduced RET dosage is likely to have tissue-specific effects in the enteric nervous system. Together, these results support a model in which common regulatory variation at RET modifies HSCR penetrance in the setting of ZEB2 haploinsufficiency. More broadly, our findings illustrate how whole-genome sequencing can reveal regulatory modifiers that contribute to variable expressivity in ostensibly monogenic disorders

19

Cellular morphology emerges from polygenic, distributed transcriptional variation

Paylakhi, S.; Geurgas, R.; Yasko, A.; Wedow, R.; Tegtmeyer, M.

2026-03-13 genetics 10.64898/2026.03.12.711281 medRxiv

Top 0.1%

25.6%

Show abstract

Height and most disease risk are known polygenic traits: characteristics governed by multiple genes at different loci instead of a select few. Though we are beginning to understand how genetic variation impacts cell morphology, whether such an analogous polygenic architecture operates at the cellular level, where morphology integrates cytoskeletal organization, organelle positioning, and metabolic state, has yet to be systematically tested. Here, we demonstrate that cellular morphology behaves as a polygenic trait by integrating multimodal modeling, perturbation profiling, and population-scale genetic variation. A shared latent-space autoencoder trained on four large-scale perturbation datasets predicts morphology from gene expression and generalizes without retraining to matched RNA-seq and Cell Painting profiles from 100 genetically diverse iPSC donors. The model predicted 17 morphological features (R{superscript 2} > 0.6, permutation FDR q < 0.05), enriched for spatial organelle distribution and cytoskeletal architecture. Predictive performance does not arise from dominant gene-phenotype relationships: individual genes contribute modestly, and marginal gene-morphology correlations are uniformly weak, revealing a distributed regulatory architecture. Despite this polygenicity, CRISPR perturbation data from the JUMP consortium validates specific model-prioritized genes, such as the cytoskeletal regulator TIAM1, membrane trafficking factor RAB31, and mitochondrial-associated membrane transporter ABCC5, as molecular anchors whose disruption produces feature-specific morphological shifts. Transcriptome-wide association analyses identify correlational variant-gene-morphology chains linking cis-regulatory variation through mitochondrial metabolism (PDHX) and iron transport (SLC11A2) to cellular architecture. These results establish cellular morphology as a polygenic systems phenotype, extending the omnigenic framework to the cellular level and providing a biological basis for interpreting cross-modal prediction in functional genomics.

20

The Biobank Rare Variant consortium powers the discovery of rare genetic associations through global collaboration

Palmer, D. S.; Hill, B.; Hodgson, S.; Joeloo, M.; Kalantzis, G.; Kousathanas, A.; Koyama, S.; Lu, W.; Namba, S.; Rodriguez, Z. B.; Shortt, J. A.; Sonehara, K.; Vartanian, N.; Vy, H. M. T.; Wade, I. A.; White, S. L.; Baya, N. A.; Chami, N.; Do, R.; Estrada, K.; Finer, S.; Genovese, G.; Guez, J.; Itan, Y.; Kanai, M.; Lassen, F. H.; Matsuda, K.; Moutsianas, L.; Peloso, G. M.; Priit, P.; Rader, D. J.; Rendon, A.; Rocheleau, G.; Sadeghi-Alavijeh, O.; Selvaraj, M. S.; Smit, R. A.; Wang, D.; Wigdor, E. M.; Yu, Z.; Colorado Center for Personalized Medicine, ; Estonian Biobank Research Team, ; Genes

2026-05-24 genetic and genomic medicine 10.64898/2026.05.21.26353759 medRxiv

Top 0.1%

25.4%

Show abstract

Rare coding variants can have large effects on disease risk and provide direct routes from human genetics to disease mechanisms and therapeutic targets, but their discovery is constrained by sample size, particularly for low-prevalence diseases. Here we establish the Biobank Rare Variant Analysis (BRaVa) consortium, a global rare variant association resource that integrates sequencing and linked health-record data from ten biobanks and cohorts comprising over 1.2 million individuals across diverse ancestries. We performed gene-based meta-analyses of rare coding variation across 33 clinical endpoints and 11 quantitative traits. Aggregating evidence across biobanks and ancestries identified 514 gene-trait associations, including 31 not previously reported in prior studies or curated association resources following systematic literature review. Notably, 36.1% of gene-level associations were undetectable in any individual biobank, and 91 emerged only through cross-ancestry meta-analysis, demonstrating that federated integration enables discovery beyond the reach of single cohorts. Similar gains were observed at the variant level, where 25.0% of phenotype-locus associations were detectable only through meta-analysis. Effect size estimates were correlated across ancestries with concordant directions of effect, supporting the generalizability of rare variant associations. The identified signals implicate pathways involved in transcriptional and epigenetic regulation, metabolism, vascular and epithelial biology, and immune function, highlighting rare coding variation as an engine for biological discovery across medical record phenotypes. For example, damaging variation in ANKRD12 implicates inflammatory transcriptional dysregulation in asthma and chronic obstructive pulmonary disease, and ultra-rare predicted loss-of-function variants in NAA15 link protein acetylation processes to type 2 diabetes risk. BRaVa establishes a scalable framework and freely available community resource for rare variant meta-analysis across global biobanks. Public release of gene- and variant-level association summary statistics provides a reference map of rare coding variant associations to support disease gene discovery, biological interpretation, and therapeutic target prioritization as sequencing-linked health-record resources continue to expand.